The Diagonalized Newton Algorithm for Nonnegative Matrix Factorization

نویسنده

  • Hugo Van hamme
چکیده

5 Non-negative matrix factorization (NMF) has become a popular machine 6 learning approach to many problems in text mining, speech and image 7 processing, bio-informatics and seismic data analysis to name a few. In 8 NMF, a matrix of non-negative data is approximated by the low-rank 9 product of two matrices with non-negative entries. In this paper, the 10 approximation quality is measured by the Kullback-Leibler divergence 11 between the data and its low-rank reconstruction. The existence of the 12 simple multiplicative update (MU) algorithm for computing the matrix 13 factors has contributed to the success of NMF. Despite the availability of 14 algorithms showing faster convergence, MU remains popular due to its 15 simplicity. In this paper, a diagonalized Newton algorithm (DNA) is 16 proposed showing faster convergence while the implementation remains 17 simple and suitable for high-rank problems. The DNA algorithm is applied 18 to various publicly available data sets, showing a substantial speed-up on 19 modern hardware. 20 21 1 Introduction 22 Non-negative matrix factorization (NMF) denotes the process of factorizing a N×T data 23 matrix V of non-negative real numbers into the product of a N×R matrix W and a R×T 24 matrix H, where both W and H contain only non-negative real numbers. Taking a column25 wise view of the data, i.e. each of the T columns of V is a sample of N-dimensional vector 26 data, the factorization expresses each sample as a (weighted) addition of columns of W, 27 which can hence be interpreted as the R parts that make up the data [1]. Hence, NMF can be 28 used to learn data representations from samples. In [2], speaker representations are learnt 29 from spectral data using NMF and subsequently applied to separate their signals. Another 30 example in speech processing is [3] and [4], where phone representations are learnt using a 31 convolutional extention of NMF. In [5], time-frequency representations reminiscent of 32 formant traces are learnt from speech using NMF. In [6], NMF is used to learn acoustic 33 representations for words in a vocabulary acquisition and recognition task. Applied to image 34 processing, local features are learnt from examples with NMF in order to represent human 35 faces in a detection task [7]. 36 In this paper, the metric to measure the closeness of reconstruction Z = WH to its target V is 37 measured by their Kullback-Leibler divergence: 38 39 V, Z = − , + , , (1) Given a data matrix V, the matrix factors W and H are then found by minimizing cost 40 function (1), which yields the maximum likelihood estimate if the data are drawn from a 41 Poisson distribution. The multiplicative updates (MU) algorithm proposed in [1] solves 42 exactly this problem in an iterative manner. Its simplicity and the availability of many 43 implementations make it a popular algorithm to date to solve NMF problems. However, there 44 are some drawbacks to the algorithm. Firstly, it only converges locally and is not guaranteed 45 to yield the global minimum of the cost function. It is hence sensitive to the choice of the 46 initial guesses for W and H. Secondly, MU is very slow to converge. The goal of this paper 47 is to speed up the convergence while the local convergence property is retained. The 48 resulting Diagonalized Newton Algorithm (DNA) uses only simple element-wise operations, 49 such that its implementation requires only a few tens of lines of code, while memory 50 requirements and computational efforts for a single iteration are about the double of an MU 51 update. 52 The faster convergence rate is obtained by applying Newton’s method to minimize 53 dKL(V,WH) over W and H in alternation. Newton updates have been explored for the Frobe54 nius norm to measure the distance between V and Z in e.g. [8]-[13]. Specifically, in [11] a 55 diagonal Newton method is applied Frobenius norms. For the Kullback-Leibler divergence, 56 fewer studies are available. Since each optimization problem is multivariate, Newton updates 57 typically imply solving sets of linear equations in each iteration. In [16], the Hessian is 58 reduced by refraining from second order updates for the parameters close to zero. In [17], 59 Newton updates are applied per coordinate, but in a cyclic order, which is troublesome for 60 GPU implementations. In the proposed method, matrix inversion is avoided by diagonalizing 61 the Hessian matrix. The resulting updates resemble the ones derived in [18] to the extent that 62 they involve second order derivatives. Important differences are that [18] involves the non63 negative k-residuals hence requiring flooring to zero. Of course, the diagonal approximation 64 may affect the convergence rate adversely. Also, Newton algorithms only show (quadratic) 65 convergence when the estimate is sufficiently close to the local minimum and therefore need 66 damping, e.g. Levenberg-Marquardt as in [14], or step size control as in [15] and [16]. In 67 DNA, these convergence issues are addressed by computing both the MU and Newton 68 solutions and selecting the one leading to the greatest reduction in dKL(V,Z). Hence, since 69 the cost is non-decreasing under MU, it will also be under DNA updates. This robust safety 70 net can be constructed fairly efficiently because the quantities required to compute the MU 71 have already been computed in the Newton update. The net result is that DNA iterations are 72 only about two to three times as slow as MU iterations, both on a CPU and on a GPU. The 73 experimental analysis shows that the increased convergence rate generally dominates over 74 the increased cost per iteration such that overall balance is positive and can lead to speedups 75 of up to a factor 6. 76 2 NMF formulation 77 To induce sparsity on the matrix factors, the KL-divergence is often regularized, i.e. one 78 seeks to minimize: 79 V, W + ρ , + λ h , (2) subject to non-negativity constraints on all entries of W and H. Here, ρ and λ are non-negative re80 gularization parameters. 81 Minimizing the regularized KL-divergence (2) can be achieved by alternating updates of W and H 82 for which the cost is non-increasing. The updates for this form of block coordinate descent are: 83 ← arg min # ≥ % & V, W # + λ h # , ' (3) ( ← arg min (# ≥ % & V, ( # + ρ # , ' (4) Because of the symmetry property dKL(V,WH) = dKL(V t ,H t W t ), where superscript-t denotes 84 matrix transpose, it suffices to consider only the update on H. Furthermore, because of the 85 summation over all columns in (1), minimization (3) splits up into T independent optimiza86 tion problems. Let v denote any column of V and let h denote the corresponding column of H, 87 then the following is the core minimization problem to be considered: 88 ) ← arg min )# ≥ % V, W ) # + λ* )# (5) where 1 denotes a vector of ones of appropriate length. The solution of (5) should satisfy the KKT 89 conditions, i.e. for all r with hr > 0 90 + v , W ) + λ* ) +h = − W h + + λ = 0 where hr denotes the r-th component of h. If hr = 0, the partial derivative is positive. Hence the 91 product of hr and the partial derivative is always zero for a solution of (5), i.e. for r = 1 ... R: 92 h W h − h & + λ' = 0 (6) Since W-columns with all-zeros do not contribute to Z, it can be assumed that column sums of W 93 are non-zero, so the above can be recast as: 94 h W q W 1 + λ − h = 0 where qn = vn / (Wh)n. To facilitate the derivations below, the following notations are introduced: 95 1 = W q W 1 + λ − 1 (7) which are functions of h via q. The KKT conditions are hence recast as [20] 96 1 h = 0 for r = 1 ... R (8) Finally, summing (6) over r yields 97 ( * + λ h = (9) which is satisfied for any guess h by renormalizing: 98 h ⇠ h v 1 ) ( * + λ (10) 2.1 Multiplicat ive updates 99 For the more generic class of Bregman divergences, it was shown in a.o. [20] that multipli100 cative updates (MU) are non-decreasing at each update of W and H. For KL-divergence, MU 101 are identical to a fixed point update of (6), i.e. 102 h ⇠ h 1 + 1 = h W q W 1 + λ (11) Update (11) has two fixed points: hr = 0 and ar = 0. In the former case, the KKT conditions imply 103 that ar is negative. 104 105 2.2 Newton updates 106 To find the stationary points of (2), R equations (8) need to be solved for h. In general, let g(h) be 107 an R-dimensional vector function of an R-dimensional variable h. Newton’s update then states: 108 ) ⇠ ) − ∇8 9:8 ) with ∇8 = = ∂ ) ∂h= (12) Applied to equations (8): 109 ∇8 = = 1=? = − h W 1 + λ @ @ @= () @A @ (13) where δrl is Kronecker’s delta. To avoid the matrix inversion in update (12), the last term in 110 equation (13) is diagonalized, which is equivalent to solving the r-th equation in (8) for hr with all 111 other components fixed. With 112 B = 1 W 1 + λ A W h A (14) which is always positive, an element-wise Newton update for h is obtained: 113 h ⇠ h h B h B − 1 (15) Notice that this update does not automatically satisfy (9), so updates should be followed by a 114 renormalization (10). One needs to pay attention to the fact that Newton updates will attract 115 towards both local minima and local maxima. Like for the EM-update, hr = 0 and ar = 0 are the 116 only fixed points of update (15), which are now shown to be locally stable. In case the optimizer is 117 at hr = 0, ar is negative by the KKT conditions, and update (15) will indeed decrease hr. In a 118 sufficiently small neighborhood of a point where the gradient vanishes, i.e. ar = 0, update (15) will 119 increase (decrease) hr if and only if (11) increases (decreases) its estimate. Since if (11) never 120 increases the cost, update (15) attracts to a minimum. 121 However, this only guarantees local convergence for per-element updates and Newton methods 122 are known to suffer from potentially small convergence regions. This also applies to update (15), 123 which can indeed result in limit cycles in some cases. In the next subsections, two measures are 124 taken to respectively increase the convergence region and to make the update non-increasing. 125 2.3 Step size l imitat ion 126 When ar is positive, update (15) may not be well-behaved in the sense that its denominator can 127 become negative or zero. To respect nonnegativity and to avoid the singularity, it is bounded 128 below by a function with the same local behavior around zero: 129 h B h B − 1 = 1 1 − 1 h B ≥ 1 + 1 h B Hence, if ar ≥ 0, the following update is used: (16) h ⇠ h 1 + 1 h B = h + 1 B (17) Finally, step sizes are further limited by flooring resp. ceiling the multiplicative gain applied to hr 130 in update (15) and (17) (see Algorithm 1, steps 11 and 24 for details). 131 2.4 Nonincrease of the cost 132 Despite the measures taken in section 2.3, the divergence can still increase under the Newton 133 update. A very safe option is to compute the EM update additionally and compare the cost 134 function value for both updates. If the EM update is be better, the Newton update is rejected and 135 the EM update is taken instead. This will guarantee non-increase of the cost function. The 136 computational cost of this operation is dominated by evaluating the KL-divergence, not in 137 computing the update itself. 138 3 The Diagonalized Newton Algorithm for KLD-NMF 139 In Algorithm 1, the arguments given above are joined to form the Diagonalized Newton Algorithm 140 (DNA) for NMF with Kullback-Leibler divergence cost. Matlab TM code is available from 141 www.esat.kuleuven.be/psi/spraak/downloads both for the case when V is sparse or dense. 142 143 Algorithm 1: pseudocode for the DNA KLD-NMF algorithm. ⊘ and ⊙ are element-wise division 144 and multiplication respectively and [x]ε = max(x,ε). Steps not labelsed with “MU” is the additional 145 code required for DNA. 146 Input: data V, initial guess for W and H, regularization weights ρ and λ. 147 MU Step 1: divide the r-th column of W by ∑ + λ. Multiply the r-th row of H by the same number. 148 MU Step 2: Z = WH 149 MU Step 3: F = G⊘ H 150 Repeat until convergence 151 MU Step 4: precompute (⨀( 152 MU Step 5: J = ( F − 1 153 MU Step 6: KL = + J⨀ 154 MU Step 7: HKL = ( KL 155 MU Step 8: FMN = G⊘ HKL 156 MU Step 9: OKL = * PG⨀log FKL R 157 Step 10: FS = G⊘ T⨀T ; U = (⨀( VFS 158 Step 11: WXY = ⊙ ZU⊘ U − J [\for the entries for which A < 0 159 WXY = +]^_ J⊘ ⊙ U , ` for the entries for which A ≥ 0 160 multiply t-th column of HDNA with the t-th entry of * G ⊘ * WXY 161 Step 12: HWXY = ( WXY 162 Step 13: Fabc = G⊘ Habc 163 Step 14: OWXY = * PG⨀log FWXY R 164 Step 15: copy H, Z and Q from: 165 HDNA, ZDNA and QDNA for the columns for which dDNA < dMU 166 HEM, ZEM and QEM for the columns for which dDNA ≥ dMU 167 MU Step 16: divide (multiply) the r-th row (column) of H (W) by ∑ h + ρ. 168 Step 17: precompute ⨀ 169 MU Step 18: J = F V − 1 170 MU Step 19: (KL = (+ J⨀( 171 MU Step 20: HKL = (KL 172 MU Step 21: FMN = G⊘ HKL 173 MU Step 22: OKL = PG⨀log FKL R* 174 Step 23: FS = G⊘ T⨀T ; U = FS ⨀ V 175 Step 24: (WXY = (⊙ ZU⊘ U − J [\for the entries for which A < 0 176 (WXY = ( +]^_ J⊘ ⊙U , `( for the entries for which A ≥ 0 177 multiply the n-th row of WDNA with the n-th entry of G* ⊘ (WXY* 178 Step 25: HWXY = (WXY 179 Step 26: Fabc = G⊘ Habc 180 Step 27: OWXY = PG⨀log FWXY R* 181 Step 28: copy W, Z and Q from: 182 WDNA, ZDNA and QDNA for the rows for which dDNA < dMU 183 WEM, ZEM and QEM for the rows for which dDNA ≥ dMU 184 MU Step 29: divide(multiply) the r-th column (row) of W (H) by ∑ + λ. 185 Notice that step 9, 14 22 and 27 require some care for the zeros in V, which should not contribute 186 to the cost. In terms of complexity, the most expensive steps are the computation of A, B, ZMU and 187 ZDNA, which require O(NRT) operations. All other steps require O(NR), O(RT) or O(NR) 188 operations. Hence, it is expected that a DNA iteration is about twice as slow as MU iteration. On 189 modern hardware, parallelization may however distort this picture and hence experimental 190 verification is requied. 191

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Modified Digital Image Watermarking Scheme Based on Nonnegative Matrix Factorization

This paper presents a modified digital image watermarking method based on nonnegative matrix factorization. Firstly, host image is factorized to the product of three nonnegative matrices. Then, the centric matrix is transferred to discrete cosine transform domain. Watermark is embedded in low frequency band of this matrix and next, the reverse of the transform is computed. Finally, watermarked ...

متن کامل

Nonnegative Matrix Factorization via Newton Iteration for Shared-memory Systems∗

Nonnegative Matrix Factorization (NMF) can be used to approximate a large nonnegative matrix as a product of two smaller nonnegative matrices. This paper shows in detail how an NMF algorithm based on Newton iteration can be derived utilizing the general Karush-KuhnTucker (KKT) conditions for first-order optimality. This algorithm is suited for parallel execution on shared-memory systems. It was...

متن کامل

A Projected Alternating Least square Approach for Computation of Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) is a common method in data mining that have been used in different applications as a dimension reduction, classification or clustering method. Methods in alternating least square (ALS) approach usually used to solve this non-convex minimization problem.  At each step of ALS algorithms two convex least square problems should be solved, which causes high com...

متن کامل

A Modified Digital Image Watermarking Scheme Based on Nonnegative Matrix Factorization

This paper presents a modified digital image watermarking method based on nonnegative matrix factorization. Firstly, host image is factorized to the product of three nonnegative matrices. Then, the centric matrix is transferred to discrete cosine transform domain. Watermark is embedded in low frequency band of this matrix and next, the reverse of the transform is computed. Finally, watermarked ...

متن کامل

A Parallel Algorithm for Nonnegative Matrix Factorization Based on Newton Iteration

Nonnegative Matrix Factorization (NMF) is a technique to approximate a nonnegative matrix as a product of two smaller nonnegative matrices. The guaranteed nonnegativity of the factors allows interpreting the approximation as an additive combination of features, a distinctive property that other widely used matrix factorization methods do not have. Several advanced methods for computing this fac...

متن کامل

جداسازی طیفی و مکانی تصاویر ابرطیفی با استفاده از Semi-NMF و تبدیل PCA

Unmixing of remote-sensing data using nonnegative matrix factorization has been considered recently. To improve performance, additional constraints are added to the cost function. The main challenge is to introduce constraints that lead to better results for unmixing. Correlation between bands of Hyperspectral images is the problem that is paid less attention to it in the unmixing algorithms. I...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1301.3389  شماره 

صفحات  -

تاریخ انتشار 2013